AD-763  7  23 


r 


SPEECH  UNDERSTANDING  SYSTEMS 


James  W.  Forgie 


Massachusetts  Institute  of  Technology 


v. 


Prepared  for: 


Electronic  Systems  Division 
Advanced  Research  Projects  Agency 


31  May  1973 


DISTRIBUTED  BY: 


National  Technical  Information  Service 
U.  S.  DEPARTMENT  OF  COMMERCE 

5285  Port  Royal  Road,  Springfield  Va.  22151 


BEST 

AVAILABLE  COPY 


10.  AVAILABILITY/UmiTaTION  NOTICES 


Approved  for  public  release;  distribution  unlimited. 


12.  SPONSORING  MILITARY  ACTIVITY 

Advanced  Research  Projects  Agency, 
Department  of  Defense 


13.  »  DSTRACT 

The  phoneme  class  segmentation  a  if  formant  tracking  algorithm  have  been  extended  and  evaluated  in  some 
detail.  Encouraging  results  have  been  obtained  from  a  simple  phoneme  i  Jentification  program  working  on  the 
output  of  the  segmentation  program. 

The  sentence  generation  program  has  Ivon  extended  to  produce  questions  and  statements  as  well  as  com¬ 
mands  appropriate  to  the  task  domain  of  the  Lincoln  experimental  speech  understanding  system.  A  semantic 
component  has  been  added  to  the  heuristic  search  program  used  in  processing  utterances  in  this  task  domain 
(the  vocal  command  of  the  speech  data  retrieval,  analysis,  and  display  system). 

A  General  5yntav  Program  lias  lieen  implemented  to  support  work  on  a  variety  of  parsing  strategies.  A 
doctoral  thesis  on  locally  organized  parsing  for  spoken  input  has  been  completed. 

The  speech  data  base  hardware  and  software  have  )x*cn  actively  used  in  supporting  workshop  meetings 
among  the  ARPA  Speech  contractors.  Encouraging  results  are  being  achieved  with  set  automate,  phonetic  label  - 
ing  program.  Network  retrieval  of  data  from  the  data  base  lias  been  demon  tie 

The  TX  -7  system  has  been  exten.  ed  to  provide  support  for  the  various  current  m  as  of  activity.  A  new 
disk  system  has  been  delivered  and  i  .  being  installed.  A  user  authentication  (password)  sci.cme  lias  been  de¬ 
veloped  which  Is  well  suited  to  the  open  environment  of  the  TX-2  system  where  the  usual  mechanisms  <or  hid¬ 
ing  user  identification  information  are  not  available. 


14.  KEY  WORDS 

speech  understanding  systems 
linear  predictive  coding  (LPC) 
phonetic  recognition 
TELNET 

t 

I 


II.  supplementary  notes 
None 


L.PARS 
SURNET 
TX-2  System 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 
LINCOLN  LABORATORY 


SPEECH  UNDERSTANDING  SYSTEMS 

SEMIANNUAL  TECHNICAL  SUMMARY  REPORT 

TO  THE 

ADVANCED  RESEARCH  PROJECTS  AGENCY 


1  DECEMBER  1972  -  31  MAY  1973 

ISSUED  2  JULY  1973 


Approved  for  publi ~  release;  distribution  unlimited. 


1 


EXINGTOX 


MASSACHUSETT 


The  work  reported  in  this  document  was  performed  at  Lincoln  LoboratuiV, 
.  center  for  research  operated  by  Massachusetts  Institute  of  Technology. 
This  work  was  sponsored  by  the  Advanced  Research  Projects  Agency  of  the 
Department  of  Defense  under  Air  Force  Contract  F 1 9628-7  l-C-0002  (ARPA 
Order  2006X 

This  report  may  be  reproduced  to  satisfy  reeds  of  U.S.  Government  agencies. 


Non-Lmcoln  Recipients 

PLEASE  DO  NOT  RETURN 

Permission  is  given  to  destroy  this  document 
when  it  is  no  longer  needed. 


S  U  M  M  A  R  Y 


The  phoneme  class  segmentation  and  formant  tracking  algorithms  have  heen  extended 
and  evaluated  in  some  detail  Encouraging  results  have  been  obtained  from  a  simple 
phoneme  identification  program  working  on  the  output  of  the  segmentation  program. 

The  sentence  generation  program  has  been  extended  to  produce  questions  and  state¬ 
ments  as  wc'l  as  commands  appropriate  to  the  task  domain  of  the  Lincoln  experi¬ 
mental  spee<  understanding  system.  A  semantic  component  has  been  added  to  the 
heuristic  se-  eh  program  used  in  processing  utterances  in  this  task  domain  (the 
vocal  command  of  the  speech  data  retrieval,  analysis,  and  display  system). 

A  General  Syntax  Program  has  been  implemented  to  support  work  on  a  variety  of 
parsing  strategies.  A  doctoral  thesis  on  locally  organized  parsing  for  spoken  input 
has  been  completed 

The  speech  data  base  hardware  and  software  have  been  actively  used  in  supporting 
v.orkshop  meetings  among  the  ARPA  Speech  contractors.  Flncouraging  results  are 
being  achieved  with  an  automatic  phonetic  lahcl  ,ng  program.  Network  retrieval  of 
data  from  the  data  base  has  been  demonstrated. 

The  TX-2sv.su.rn  Mas  been  extended  to  provide  si  pport  for  the  various  current  areas 
of  activity.  A  new  disk  system  has  been  delivered  and  is  being  installed.  A  us  r 
authentication  (password)  scheme  has  been  developed  which  is  well  suited  to  the  open 
environment  of  the  TX-?  system  where  the  usual  mechanisms  for  hiding  user  identi¬ 
fication  information  an:  not  available. 


Accepted  for  the  Air  Force 
Joseph  J.  Whelan,  1JSAF 

Acting  Chief,  Lincoln  Laboratory  Liaison  Office 


CONTENTS 


Summary  iii 

Glossary  vi 

1.  PHONETIC  RECOGNITION  1 

A.  Segmentation  1 

B.  Formant  Tracking  4 

C.  Phoneme  Identification  5 

II.  LINGUISTICS  7 

A.  Lincoln  Experimental  System  7 

B.  General  Syntax  Program  10 

C.  LPARS  —  A  Locally  Organized  PARSer  for  Spoken  Input  15 

III.  SPEECH  DATA  BASE  ip. 

A.  Speech  input /Output  Hardware  19 

B.  Automatic  Labeling  19 

C.  Bata-Base  Software  2*1 

D.  SUH  NET  23 

IV.  SYSTEM  ACTIVITIES  2  3 

A.  TSP  System  23 

B.  TX-2  System  24 

C.  TX-2  Password  Scheme  24 


Preceding  page  iilank 


V 


C  LOSRA  it  V 


A PEX 
A  it  PA 


HCPI 


i  UP 

GST1 

LPC 


TX-2  lime-sharing  system 
Advanced  Research  Projects  Agency 


Basic  c  ined  Programming  Language  -  intermediate- 
level  language  for  computer  programming 

Fast  Digital  Processor  -  Lincoln  Laboratory  computer 
designed  for  waveform  processing  applications 

General  Svntaji  Program  —  a  soitware  svstem  to  support  natural 
language  processing 


Linear  Predictive  Ceding  —  method  for  signal  analysis  being 
'Sed  in  current  speech  research 


SAT:-  Semiannual  Technical  Summary  Report 

SPRNEV  Specialized  server  process  intended  to  provide  n.  -ork  access 
to  speech  data  base  on  TX  2 


r  r  1  In  FI  Software  which  allows  console  on  one  network  computer  to 
function  as  console  for  another 

TSP  Terminal  Support  Processoi 


vi 


SPEECH  l'  NDKH  STANDING  SYSTEMS 


i  PHONETIC  RECOGNITION 

Uork  in  phonetic  recognition  is  proceeding  along  two  rather  different  lines.  The  first  is 
concerned  with  the  front-end  problem  of  extracting  basic  information  from  the  speech  signal, 
given  little  or  nothing  in  the  way  of  linguistic  support  for  the  recognition  process.  The  goal  of 
the  fron*  end  is  to  produce  some  hinq  like  a  phonemic  transcription  of  the  input  speech  which  can 
serve  as  an  input  to  linguistic  processing  modules.  The  expectation  is  that  such  a  transcription 
will,  at  best,  be  an  approximation  to  a  nominal  transcription  made  by  a  human  listener.  While 
significant  amounts  of  error  and  ambiguity  in  such  a  transcription  are  to  be  expected  and  car. 
probably  be  tolerated,  the  transcription  must  be  accurate  enough  to  guide  the  linguistic  processing 
modules  toward  the  hypothesis  of  the  correct  sentence  in  a  reasonable  amount  of  time. 

The  bulk  of  our  phonetic -recognition  W'ork  to  date  has  been  concerned  with  the  front-end 
problem.  We  have  developed  and  previously  described  algorithms  for  phoneme  class  segmenta¬ 
tion  and  formant  tracking.  We  are  currently  working  on  the  ldentu ication  of  individual  phonemes 
both  by  examining  overall  frequency-time  patterns  and  by  detailed  examination  of  formant  motions. 

The  second  line  of  work  in  phonetic  recognition  is  aimed  at  the  verification  of  hypotheses 
gen  -rated  on  the  basis  of  linguistic  as  well  as  acoustic  information.  The  problem  here  may  be 
to  determine  which  -.if  a  sei  of  proposed  words,  phrases,  sentences,  or  whatever  is  the  best  fit 
to  the  input  data;  or  the  problem  may  be  to  determine  whether  a  particular  hypothesis  is  accept¬ 
able  or  not  witii  respect  to  some  arbitrary  measure  of  goodness  of  fit.  The  verification  process 
must  e:  pect  to  have  to  cope  on  occasion  with  minimal  pair  phonemic  decisions  and  the  effects  of 
coarti dilation,  and  it  thus  must  make  finer,  more  sophisticated  judgments  than  are  required  of 
front-end  recognition  processes.  On  the  other  hand,  the  set  of  candidates  tote  discriminated 
among  will  generally  be  much  smaller  than  the  full  set  of  possibilities  which  the  front  end  must 
face.  A  previous  study  of  the  interaction  between  acoustic  discriminations  and  linguistic  con¬ 
straints  suggested  that  the  human  listener  adjusts  the  measurement  space  for  the  discrimination 
according  to  the  set  of  alternatives  suggested  by  the  linguistic  context.  We  feel  that  a  speech 
understanding  system  will  need  to  make  similar  use  of  context  to  make  similar  discriminations. 

Our  current  work  in  the  verification  area  is  aimed  at  developing  a  practical  means  of  de¬ 
fining  an  appropriate  discriminant  space  given  a  limited  set  of  candidate  words.  Since  the  set 
of  candidate  words  is  different  each  time  a  verification  is  to  be  made,  the  reeogisition  space  will 
be  a  fferent  and,  hopefully,  the  maximum  help  will  be  gained  from  the  context  available  at  that 
moment.  A  simplified  experiment  has  been  planned  to  test  the  potential  utility  of  this  concept. 
Appropriate  sentences  have  been  recorded  and  entered  into  the  speech  data  base.  Further  dis¬ 
cussion  of  the  experiment  will  he  deferred  until  the  next  report  in  this  series  when  results  should 
be  available. 

The  following  sections  indicate  the  current  state  of  development  and  evaluation  of  the  pre¬ 
viously  described  phoneme  class  segmentation  and  formant  tracking  algorithms,  and  present 
some  results  from  the  phoneme-identification  research. 

A.  Segmentation 

Considerable  development  and  data  collection  have  been  done  on  the  segmentation  algorithm. 
The  i  -significant  addition  to  the  algorithm  since  the  last  SATS^  is  a  generalized  dip  detector. 


1 


iiiliiiiM 


VWL 

01**  1  (0-5000  ISMJ 
0IF2  (640-28001  SMI 
01P3  <0-5000  HSMJ 
0IP4  (640'2800HSM) 

ASP 


FRI 


VFR 

FRC 


8ST 


V8R 


SiL 


LSM  *  LIGHT  SMOOTHING 
HSM  *  HFAVT  SMOOTHING 


Fig.  1.  Phoneme  class  segmentation  algorithm.  RMS(A-B)  =  root  mean  square 
of  sum  of  squares  of  <  pectral  components  in  range  A-B. 


Fig.  2.  Confusion  matrix  -  phoneme 
class  segmentation  results  for  38 
sentences. 


SEGMENT  LABELS 


VFR 

FRC  SIL 


VWL 

DIP 

FRI 

V8R 

asr 

ASP 

VOWELS 

265 

9 

7 

nasals, 

glides 

u 

45 

5 

frics 

f.th 

5,  Z 
SH.ZH 

i 

87 

6 

2 

STOPS, 

AFFRICS 

4 

4 

113 

61 

58 

flapped 

t 

1 

8 

H 

2 

7 

• 

4 

v,  dm 

6 

,5 

2 

22 

6 

11 

which  allows  detection  and  marking  of  dips  in  any  of  several  volume  (root-mean-square  amplitude) 
functions  during  vowel-like  segments.  This  generalized  dip  detector  has  considerably  improved 
the  detection  of  voiced  consonants  which  are  either  between  vowels  or  between  a  vowel  and  a 
strong  fricative.  Another  addition  to  the  segmentation  algorithm  is  a  burst  detector,  designed 
to  locate  the  burst  following  a  stop  consonant. 

The  current  segmentation  algorithm  is  shown  in  Fig.  1 .  The  indicated  spectral  measurements 
are  per. armed  on  the  linear  predictive  spectrum  (previously,  the  homomorphic  spectrum  was 
used).  Every  5  msec,  one  of  the  12  segmentation  symbols  is  assigned  to  the  speech.  (Note  ;hat 
the  updating  rate  was  previously  every  6.4  msec.)  A  string  of  like  symbols  is  referred  to  as  a 
segment  of  a  particular  class.  Considerable  editing  is  done  to  eliminate  unduly  short  segments. 

The  generalized  dip  detector  processes  all  vowel-like  segments.  First,  dips  are  sought  in 
a  lightly  smoothed  version  of  RMS  (0-5000)*;  these  detected  dips  are  marked  as  DPI.  Remaining 
vowel-like  segments  are  then  successfully  examined  for  dips  in  the  three  other  volume  functions. 

The  vowel-like  segments  which  are  not.  DIPs  are  marked  as  VWL.  The  segment  classes 
FR1,  VFR,  and  FRC  are  intended  to  locate  the  fricative  sounds  /S/,  /Z/,  /SH/,  /ZH/,  /f/,  and 
/TH/.t  The  classes  S1L  and  VBR  are  intended  to  locate  closures  in  stop  consonants.  BST  ar.d 
ASP  are  useful  in  locating  the  burst  and  aspiration  following  a  stop  consonant. 

A  summary  of  some  results  obtained  from  the  segmentation  program  is  shown  in  Fig.  2. 

These  results  were  obtained  on  38  sentences,  from  several  speakers.  The  sentences  were  pho- 
nemically  labeled  by  observation  of  spectrograms  and  knowledge  of  what  was  spoken.  The  cor¬ 
respondence  of  phonemes  to  segment  labels  which  was  observed  and  entered  into  the  matrix  was 
not  1:1  in  either  direction.  A  phoneme  could  have  more  than  one  segment  (e.g.,  STOP  —  SIL, 

BST,  ASP  or  an  insertion  error  like  VOWEL  —  VWL,  DIP,  VWL)  and  a  segment  could  correspond 
to  more  than  one  phoneme  (e.g.,  FR1C,  FR1C  —  FRC).  A  separate  tabulation  of  all  non-l:l  cases 
has  been  made.  In  tabulating  the  matrix,  it  was  important  to  judiciously  group  both  the  phonemes 
and  segment  classes.  For  example,  the  phonemes  /V/,  /DH/  are  separated  in  the  tabulation 
from  the  other  fricatives  because  they  appear  acoustically  more  like  DIFs  or  stops.  The 
flapped  /T/  has  been  put  in  a  separate  category  from  the  other  stops,  because  its  acoustic  reali¬ 
zation  is  generally  as  a  short  DIP  rather  than  as  a  short  silence.  Nasals  and  glides  which  were 
adjacent  to  stops  were  not  included  in  this  data  tabulation,  since  the  algorithm  does  not  yet  have 
a  strategy  for  segmenting  these  phonemes. 

A  number  of  interesting  observations,  which  are  suggestive  of  the  course  of  further  work, 
may  be  made  from  this  matrix.  All  vowels  in  the  corpus  were  marked  at  least  during  part  of 
their  duration  by  VWL  and,  conversely,  all  VWL  segments  contained  at  least  one  vowel.  Hence, 
the  job  of  vowel  location  is  in  hend,  and  we  can  proceed  with  vowel  identification.  Detection  of 
the  fricatives  /V /,  /Til/,  /S/,  /Z/,  /SH/,  ami  /ZH/  was  quite  reliable  and  this  group  of  sounds 
sc»med  to  be  very  promising  for  separation  into  classes  which  are  more  phoneme-like.  Sec¬ 
tion  1-C  reports  our  work  on  identifying  these  segments.  STOP  detection  also  was  rather  satis¬ 
factory.  The  dip  detector  found  a  total  of  83  dips,  of  which  53  were  DPt  and  the  rest  were  DP2, 
DP3,  o'-  DPI.  The  DIP  segments  included  many  different  phonemes.  Seme  work  on  further 
identification  of  DIPs  is  also  reported  below.  The  phonemes  which  spread  most  widely  over  the 
segment  classes  are  /V/ and  /Dll/. 

*  RMS(A-B)  =  root  mean  square  of  the  stun  of  the  squares  of  the  spectral  components  in  the 
range  A-B. 

tin  ibis  report,  phoneme  symbols  are  represented  in  the  computer-compatible  two-character 
code  which  has  been  adopted  for  use  in  the  ARP  A  Speech  ruder  standing  Research  program. 


3 


T’.  Formant  Tracking 

The  formant  tracking  algorithm,  described  in  detail  in  the  nrevious  SATS,^  has  been  slightly 
improved  and  thoroughly  evaluated.  The  algorithm  now  keeps  statistics  on  how  often  it  had  to 
perform  its  various  correction  measures  and  also  generates  a  code  word  describing  what  sort 
of  problems  it  encountered  for  each  frame.  The  formant  trajectories  are  being  thoiuughly  studied 
to  determine  how  the  phonemes  car  be  further  segmented  and  identified  on  the  basis  ol'  formant 
i  iformation. 

Following  is  a  brief  review  of  the  six  steps  in  the  processing  of  each  lrame: 

(t)  Fetch  all  peaks  (up  to  4)  in  the  spectrum  in  the  frequency  region  from  150  to  3400  Hz. 

(2)  Fill  all  4  formant  slots  with  the  peaks  on  the  basis  of  peak  location  relative  to  an 
educated  guess. 

(3)  If  a  peak  fills  more  than  oue  slot,  keep  it  only  in  the  slot  that  it  fills  best,  and 
remove  it  from  any  other  slots. 

(4)  If  there  exists  a  peak  which  does  not  fill  any  slot  as  yet,  try  to  find  an  empty  slot 
and,  if  necessary,  move  peaks  to  new  slots  to  accommodate  it.  If  there  ore  no 
empty  slots  nearby,  or  if  the  amplitude  of  the  extra  peak  is  sufficiently  small, 
throw  the  extra  peak  away. 

(5)  If  there  ic  an  empty  slot,  recompute  the  spectrum  on  a  circle  o;  radius  less  than 
one,  to  enhance  tho  peaks,  and  hopefully  separate  two  merged  peaks  or  bring  eut 

a  peak  which  hrd  been  lost  due  to  nasalization  effects.  Repeat  steps  (1)  to  (5)  using 
the  enhanced  spectrum. 

(6)  Aceept  formant  slot  contents  as  answers  fc.r  this  frame  and  as  an  educated  guess 
for  the  next  frame. 

The  only  modification  made  in  the  processing  of  each  frame  is  that  now  enhancement  is  tried 
over  a  larger  range  of  radii.  It  was  found  that,  if  the  radius  was  allowed  to  become  as  small  as 
0.88  (previously  it  had  stooped  at  0.955),  then  enhancement  is  far  more  successful  at  retrieving 
missing  (leaks. 

After  the  six  steps  of  processing  have  been  applied  at  each  voiced  frame,  each  formant 
track  is  separately  smoothed.  First,  t  correction  for  gross  errors  is  made  -  if  one,  two,  cr 
three  frames  are  out  of  line  uuL  the  surrounding  four  frames  are  in  line,  then  each  unaligned 
frame  is  corrected  by  interpolation.  Finally,  the  formant  trajectory  is  sent  iwice  through  the 
following  smoothing  filter: 

F>(n)  =  l/4F.(n  •  1)  t  l/ZF.(n)  +  l/4F.(n  +  1)  . 

However,  FJn)  is  not  replaced  by  F!(n)  if  |  (Kl (n)  -  F^n)!  >  100  Hz.  The  result  is  that  the 
trajectory  becomes  very  smooth  where  it  was  already  reasonably  smooth,  but  sharp  changes 
in  formant  frequency,  such  as  at  the  boundary  between  a  nasal  and  a  vowel,  are  retained.  In 
addition,  if  the  onset  of  voicing  had  been  inaccurately  determined,  such  that  the  formant  data 
were  badly  out  of  line  for  a  couple  of  frames,  such  erroneous  data  would  not  distort  the  good  data 
adjacent  to  it,  r.s  the  smoothing  would  not  be  applied  th"re.  This  rule  for  not  replacing  F.(n) 
with  the  smoothed  value  takes  effect  on  about  3  percent  of  the  voiced  frames. 


4 


Statistics  were  collected  for  some  60  sentences  on  how  often  the  various  correction  measures 
were  necessary.  The  statistics  showed  that  it  is  much  more  common  in  linear  prediction  spectra 
for  a  peak  to  be  missing  than  for  a  spurious  peak  to  exist.  Although  statistics  varied  consider¬ 
ably  from  sentence-to-sentenc’  enhancement  was  tried,  on  the  average,  in  ..bout  15  percent  of 
the  voiced  frames.  Out  of  these,  a  peak  was  found  tl  rough  enhance  .lent  in  nine  out  of  ten  cases. 

Ill  the  other  one-tenth,  eithei  the  frame  was  mistakenly  labeled  voiced,  or  the  formant  was  too 
strongly  canceled  by  a  nearby  zero  (in  nasals  ar.d  nasalized  vowels)  or,  rarely,  a  peak  merger 
was  not  successfully  resolved. 

In  i  percent  of  the  voiced  frames,  F^  was  mistakenly  called  initially,  and  was  later  moved 
to  the  F,  slot  a'.er  enhancement  failed  to  yield  a  peak.  In  another  3  percent,  continuity  con¬ 
straints  had  failed  ''n  the  initial  slot-rilling  steps,  and  peaks  had  to  be  moved  to  new  slots  in 
step  (4)  to  accommodate  a  peak  about  to  be  thrown  awaj.  An  equal  number  of  peaks  were  thrown 
..way  in  step  (4)  either  because  they  failed  to  pass  the  amplitude  test  or  because  there  was  no 
slot  rvailable  for  them.  These  extra  peaks  were  usually  due  to  nasalization  effects. 

Second-pass  cor  -eetions  of  gross  errors  were  rare,  occurring  only  about  1.5  percent  of 
the  time. 

C.  Phoneme  ldentific;  tion 

Programs  have  been  developed  to  yield  phoneme-like  identification  of  speech  events  which 
have  been  isolated  as  fricative  or  dip  segments. 

The  design  of  the  fricative  identification  program  is  based  on  the  assumption  that  a  fricative 
segment  contains  one  of  the  phonemes  /S /,  /Z/,  3H/,  /ZIl/,  /F/,  /TH/.  This  assumption  is 

supported  by  the  statistics  appearing  in  the  "VFR,  FRC,  FR1"  column  of  Fig.  2.  After  several 
stages  of  experimentation,  the  fricative  program  emerged  in  the  following  very  simple  form: 

RMS  (0-2500) /ftMS  (0-5000)  is  used  to  distinguish  the  weak  voiceless  fricatives  /F/,  /TH/  from 
the  rest  (high  for  /F/,  /TH/);  RMS  (35CO-fOOO)/RMS  (2000-5000)  is  used  to  distinguish  /S/,  /Z/ 
from  ,/SH/,  /ZH/  (high  for  /S/,  /Z/) .  The  present  program  makes  no  attempt  to  distinguish 
between  voiced  and  unvoiced  phonemes  (e.g,,  /S/  vs  /Z /). 

The  dip-identification  program  is  based  on  the  assumption  that  the  primary  phonemes  occur¬ 
ring  as  dips  are  the  nasals,  liquids,  glides,  flapped  /T/  ft/.  /DH/,  and  /B/  (pronounced  without 
strong  lip  closure).  This  assumption  is  supported  by  the  statistics  in  the  DIP  column  of  Fig.  2. 

The  identification  of  flapped  /t/  and  /B/,  ft/,  /DH/  is  based  primarily  on  the  use  of  duration 
information,  and  the  behavior  of  the  spectral  derivative  in  the  middle  two-thirds  of  tiie  segment. 
The  distinction  between  flapped  /f/,  and  (/B/,  /V/,  /DH/)  is  based  exclusively  on  duration. 

The  results  of  these  programs  are  summarized  in  Figs.  3  and  4.  The  input  used  for  testing 
consisted  of  60  sentences  spoken  by  six  speakers. 

The  performani  m  the  fricative  program  was  very  good,  as  can  be  seen  from  the  dominarce 
of  the  main  diagonal  in  Fig.  3. 

The  per  formance  of  the  dip  program  is  mixed  (see  Fig.  4).  The  identification  of  flapped  /T/ 
is  quite  successful  (1  3/1  4)  The  identification  of  /B/,  ft/,  /DH/  is  only  5  out  of  10;  note,  how¬ 
ever,  that  no  other  phoneme  is  ever  mislabeled  BVDH,  out  of  a  tctol  of  79  dips.  Not  surprisingly, 
it  has  been  found  to  be  quite  difficult  to  accomplish  good  separation  of  nasals  from  liquids  only 
on  the  basis  of  spectral  shape  or  spectral  change.  In  particular,  large  spectral  change  at  seg¬ 
ment  boundaries  is  completely  ambigu  js  as  a  nasal-liquid  cue.  We  feel  that  the  use  of  formant 
tracks  holds  the  best  promise  for  sue  ?ss. 


5 


IDENTIFICATION  labels 


I  u-;  i s f lo [ 


INPUT 

PHONEMES 


SZ 

SHZH 

FTH 

(S,Z) 

62 

0 

0 

(SH,ZH) 

0 

7 

1 

(F,TH) 

2 

0 

29 

Fig.  3.  Confusion  matrix  -  fricative -identification  program. 


IDENTIFICATION  LABELS 


INPUT 

PHONEMES 


FLAPT 

BVDH 

NASLIQ 

(’flopped  t) 

13 

0 

1 

(B,V,DH) 

0 

5 

... 

5 

(NAS,  LIQ) 

(el  Hi.) 

2 

0 

53 

r  ig.  4.  Confusion  matrix  -  dip-identification  program. 


6 


II.  LINGUISTICS 


At  Lincoln  Laboratory  work  in  the  linguistics  area  falls  into  three  general  areas: 

(a)  Woik  oriented  toward  the  exploration  of  acoustic/linguistic  tradeoffs  in  the  context 
of  the  Lincoln  experimental  system  task  domain.  This  task  is  the  vocal  command 
of  our  speech  data  retrieval,  analysis,  and  display  system.  Section  11-A  presents 
a  progress  report  on  our  work  in  this  area. 

(b)  The  development  of  a  General  Syntax  Program  (GSP)  to  support  more  general 
parsing  strategies  than  those  required  for  our  limited  experimental  task  domain. 
Section  ll-B  is  a  presentation  of  the  concepts  and  capabilities  of  GSP  as  we  have 
implemented  it. 

(c)  Thesis  work  aimed  at  the  exploration  of  other  approaches  to  the  design  of  speech 
understanding  systems.  One  of  the  two  doctoral  thesis  projects  being  supported 
on  TX-2  was  completed  during  this  reporting  period.  The  concepts  involved  in 
this  thesis  and  the  results  of  experimental  evaluation  are  presented  briefly  in 
Sec.Il-C.  The  program  developea  for  the  other  thesis  project  has  been  successfully 
used  in  an  experiment  to  integrate  a  complete  speech  understanding  system  by 
linking  our  current  primitive  phonetic-recognition  modules  with  the  dictionary 
lookup,  scoring,  and  parsing  modules  of  the  thesis  program.  The  resulting  system 
correctly  recognized  a  few  sentences  but,  as  we  expected,  performance  w' s  poor 
because  of  the  limited  capability  of  the  front  end  (phoneme  class  information  only), 
and  no  attempt  is  being  made  to  pursue  the  experiment  further  until  a  more  capable 
front  end  is  available.  Further  discussion  of  this  experiment  and  the  second  thesis 
will  be  deferred  until  completion  of  the  thesis  project. 

A.  Lincoln  Experimental  System 

We  have  planned  a  series  of  experiments  to  gain  quantitative  data  about  the  information 

available  for  correcting  errors  and  resolving  ambiguities  at  the  various  levels  of  linguistic 

processing.  To  support  the  experiments,  we  are  developing  a  number  of  program  modules,  some 

of  whieh  may  serve  as  components  of  our  experimental  speech  understanding  system.  In  the 
2 

previous  SATS,  we  described  programs  for  generating  sentences  appropriate  to  our  task  domain 
and  for  the  lexieal  segmentation  of  garbled  phonetic  strings  using  a  heuristic  search  algorithm. 
The  following  sections  report  the  current  status  of  these  programs  and  describe  the  extension 
of  the  processing  algorithms  to  make  use  of  semantic  constraints. 

1.  Sentence  Generation 

Work  on  sentence  generation  is  intended  both  to  pro  ride  phonetically  transcribed  sentences 
in  sufficient  quantity  for  statistic?1  studies,  and  to  gain  insight  into  the  complexity  of  the  semantic 
model  required  to  produce  reasonable  sentences  within  our  restricted  linguistic  environment. 

The  program,  described  in  some  detail  in  the  previous  SATS,  has  Veer  extended  to  generate 
questions  and  statements  as  well  as  commands.  The  verb  is  still  ehosen  f’rst  and,  if  a  subject 
is  needed,  the  subjeet  and  its  modifiers  are  chosen  next.  Finally,  an  object  f if  required)  and 
its  modifiers  arc  selected.  A  possible  subject  is  a  question  word  ("which,"  "who,"  etc.)  and 
selecting  such  a  subject  will  cause  a  question  to  he  generated. 


The  pre-gram  has  also  been  expanded  to  form  complex  sentences.  The  vocabulary  has  almost 
doubled,  an.i  is  now  about  300  words. 

2.  Lexical  Segmentation  and  Parsing 

The  last  SATS  described  the  initial  version  of  a  system  that  applied  the  technique  of  heuristic 
search  to  the  problem  of  lexical  segmentation.  Since  that  time,  the  system  has  been  expanded  to 
include  a  semantic  component,  and  major  changes  have  been  made  in  the  lexical  segmentation 
component  (now  called  the  dictionary  module). 

a.  Two-Stage  Strategies 

The  conflicting  constraints  of  real-time  operation  and  large  computational  requirements  for 
current  algorithms  have  motivated  an  investigation  into  the  use  of  two-stage  strategies  as  a  means 
of  using  processing  time  more  fruitfully.  By  a  "two-stage"  strategy,  we  mean  one  that  uses  cues 
(sometimes  called  "insights"  or  "heuristics")  to  access  a  particularly  promising  at,  of  possible 
answers.  A  separate  algor  thm  performs  the  second-stage  task  of  analyzing  the  selected  sud- 
spa-c  in  more  detail.  £uch  strategies  are  common  in  human  decision-making. 

b.  Dictionary  Module 

The  dictionary  now  contains  300  words  pertaining  to  our  task  domain.  Stored  with  each  word 
is  its  phonemic  transcription,  including  stress  indicators.  For  some  words,  the  dictionary  also 
contains  pointers  into  the  semantic  grammar  (see  below). 

The  dictionary  words  are  accessed  through  a  three-level  net  whose  arcs  are  elements  of  the 
most  strongly  stressed  syllable  in  the  word.  This  organization  was  chosen  because  the  stressed 
syllable  usually  contains  the  most  energy  and  is  most  likely  to  he  orrectlv  identified.  We  expect 
to  augment  this  net  by  including  access  through  other  reliable  indicator-*  such  as  strong  fricatives. 

The  dictionary  module  is  called  once  for  each  stretch  of  the  input  which  has  been  determined 
to  be  strongly  stressed.  The  vowed  in  this  stretch  and  the  consonants  on  either  side  of  it  are  used 
to  calculate  a  set  of  syllabic  classes  V  *  are  "near"  to  the  one  in  *he  input.  Other  classes  arc 
added  by  postulating  word  boundaries  before  or  after  tile  vowel.  This  set  is  used  to  access  a 
set  of  dictionary  words,  Eac  word  is  matched  with  the  input  by  aligning  on  the  stressed  syllabic 
and  matching  outward.  Each  word  is  assigned  a  score  and  the  words  are  sorted  into  decreasing 
order  according  to  their  score. 

This  two-stage  strategy  reduces  processing  by  limiting  the  expensive  matching  analysis  to 
those  words  most  likely  to  achieve  a  high  score.  The  more  reliable  the  input  from  the  acoustic 
processing,  the  stricter  wo  will  be  in  defining  "nearnes3,"  and  the  fewer  words  we  will  have  to 
consider.  There  is  no  reason  why  nearness  cannot  be  defined  dynamically  if  the  acoustic  proc¬ 
essing  provides  "confidence  numbers"  for  each  segment. 

The  output  from  the  dictionary  module  is  a  sequence  of  ordered  sets  of  words,  one  set  for 
each  stressed  section  of  the  input.  The  advantage  of  this  technique  over  the  more-common  one 
of  ma.ching  hypothesized  words  individually  against  the  input  lies  in  the  fact  that  the  absolute 
score  of  a  word  is  often  not  as  important  as  its  relative  score  with  respect  to  the  alternatives 
for  this  position  in  the  utterance. 


8 


c.  Semantic  Grammar 


Semantic  information  is  represented  in  a  semantic  grammar.  This  is  a  phrase  structure 
grammar  in  which  syntactic  classes-  such  as  "adjective"  are  replaced  by  semantic  classes  such 
as  "spectrogram-descriptor." 

We  use  this  formalism  to  define  a  scries  of  sentence -forms,  each  of  which  is  keyed  to  one 
particular  type  of  command  or  question  defined  in  our  domain.  The  variations  specified  by  the 
alternatives  in  the  grammar  express  both  the  optional  equivalent  forms  of  stating  the  same  re¬ 
quest,  and  the  allowable  values  for  the  parameters  in  the  request. 

The  grammar  is  constructed  so  that  a  successful  parse  yields  a  functional  representation 
of  the  sentence  that  can  he  easily  mapped  into  an  internal  executab'e  process. 

Prior  to  run  time,  the  grammar  is  compiled  into  a  list-structured  internal  representation. 

At  this  time,  the  entry  of  each  dictionary  word  that  is  in  initial  position  in  a  sentenee-form  is 
modified  to  point  to  ihe  head  of  that  sentence-form  (it  may  point  to  more  than  one). 

d.  Semantic  Module 

This  module  uses  the  output  from  the  dictionary  module  and  the  semantic  grammar  to  produce 
an  hypothesis  about  the  identity  of  the  original  spoken  sentence.  Those  words  in  the  first  (left¬ 
most)  set  produced  by  the  dictionary  that  can  begin  sentences  are  used  to  access  n.  set  of  sentence- 
forms.  Since  all  variations  for  each  of  these  forms  must  be  considered  as  candidates  and  the 
number  of  variations  is  usually  very  large,  the  heuristic  search  approach  is  used  to  organize 
and  generate  the  hypotheses. 

The  heuristic  search  paradigm  requires  that  we  define  an  initial  set  of  states,  a  new-state 
generator,  an  evaluation  function,  and  a  goa).  In  this  instance,  the  initial  states  are  the  sentence- 
initial  words  found  by  the  dictionary  module.  A  new  stale  is  generated  by  using  the  semantic 
grammar  to  add  one  word  to  a  previous  state.  The  goal  is  to  generate  a  sentence  that  differs 
from  the  input  bv  less  than  some  given  amount.  The  evaluation  function  determines  which  state 
will  be  used  to  generate  now  states.  By  simply  changing  some  of  its  coefficients  and  the  level 
of  error  allowed,  we  can  change  the  behavior  of  the  system  from  slow  but  thorough,  to  fast  but 
inaccurate,  or  anyplace  in-between. 

Implementation  of  an  initial  version  of  the  module  has  been  completed,  but  much  experimenta¬ 
tion  will  have  to  be  done  to  evaluate  the  tradeoff  between  thoroughness  and  accuracy. 

A  major  part  of  the  effort  in  the  next  few  months  will  be  devotees  to  augmenting  the  types 
of  associative  lir.ks  ifrom  words  to  sentence -formal.  The  strongest  candidates  for  new  links 
arc  verbs  in  non-initial  position.  This  may  require  use  of  the  semantic  grammar  in  reverse 
mode  f-ight-to- left  order)  in  some  eases. 

o.  Results 

Since  the  development  of  the  acoustic  processing  module  is  proceeding  in  parallel  and  is  not 
yet  completed,  all  experiments  with  the  system  have  been  on  data  produced  by  a  "garbler."  The 
garbler  takes  typed  sentences  and  generates  phonemic  output  which  has  some  given  degree  of 
erre  and  ambiguity. 

Several  sentences  have  been  successfully  .  un  through  the  system  consisting  of  the  gaibler, 
the  dictionary  muoule,  arse  the  semantic  module.  The  times  average  about  10  see,  but  this  is 


9 


an  unrealistically  low  figure  since  only  a  very  small  fraction  of  the  total  semantic  information 
has  as  yet  beer,  enc  oded  into  the  semantic  grammar.  Moreover,  the  error  level  of  the  garbler 
was  probably  not  ss  high  as  that  to  be  expected  from  acoustic  processing  of  actual  data. 

B.  General  Syntax  Program 

We  have  implemented  a  General  Syntax  Program  (GSP)  on  TX-2  because  it  appears  to  be 
a  good  vehicle  to  study  (1)  several  different  parsing  strategies  within  the  same  general  scheme, 
and  (2)  the  use  of  parallel  processes  in  parsing.  The  basic  structure  is  the  same  as  the  GSP 
programmed  by  Ron  Kaplan,3  but  our  emphasis  has  been  on  taking  advantage  of  the  graphic  capa¬ 
bilities  of  TX-2.  So  far,  we  have  run  with  GSP  some  very  simple  transition  net  type  grammars 
which  were  constructed  basically  to  test  the  program,  and  a  wbrd  finding  grammar,  also  in  the 
form  of  a  transition  net  but  compiled  into  a  GSP  grammar  from  a  dictionary  that  gives,  for  each 
word,  phonemic  representations  and  syntactic  categories. 

GSP  has  two  basic  components:  a  grammar  which  specifies  how  structures  are  to  be  built, 
and  a  chart  which  is  used  to  represent  initial  structure  for  the  sentence  and  any  structure  found 
by  the  grammar.  An  application  of  the  grammar  to  a  part  of  the  chart  (e.g.,  an  attempt  to  find 
a  noun  phrase  beginning  with  a  particular  point  of  the  input)  is  called  a  process.  Since  one  of 
the  things  a  grammar  rule  may  do  is  to  initiate  a  process,  there  will  generally  be  several  proc  ¬ 
esses  "alive"  at  any  given  point  of  parsing,  although  (since  TX-2  is  actually  not  a  parallel  proc¬ 
essing  machine)  only  one  will  be  "active"  at  a  time. 

GSP  could  be  thought  of  os  a  parallel  processing  machine  executing  several  processes  simul¬ 
taneously.  The  grammar  could  be  thought  of  as  the  program,  the  chart  as  the  data  which  the 
program  works  on,  and  a  process  as  an  instantiation  of  a  program  working  cn  a  part  of  the  data. 
The  processes  are  like  subroutines  in  that  one  process  will  start  another  to  obtain  information 
that  it  needs,  but  t*-  ->y  differ  in  that  !i)  the  results  are  not  returned  to  ihe  calling  process,  but 
placed  in  a  com  nunicatiott  port  to  be  read  by  any  other  process  (including  the  one  that  created 
it),  and  f2)  a  process  has  a  life  independent  of  its  creating  process  .. no  may  well  continue  after 
the  process  that  created  .  disappears  (it  may  have  even  existed  before  a  process  that  "creates" 
it  existed,  since  cn  atiemp.  to  create  a  process  to  do  precisely  what  an  already-existing  process 
is  doing  results  in  the  use  of  the  existing  process). 

t .  Example  of  a  Chart 

The  chart  represents  all  structure  found  thus  far  and  is  shown  graphically  tis  a  series  of 
boxes,  each  cf  which  in?  contain  on^  v..  more  shelves.  Each  box  contains  infoimation  about  a 
kind  o.'  structure,  ano  each  shelf  contains  infer. nation  about  a  particular  structure  of  hat  kind. 
Each  shelf  has  a  label,  a  sister,  and  any  number  of  other  features  with  values.  For  e:  ample. 

Ftp.  5  shows  the  chart  in  ou»-  graphic  representation  at  the  beginning  of  parsing  the  sentence: 

"The  big  bear  eats  sweet  honey." 

Each  word  of  the  sentence  (we  are  now  assuming  typed  input,  altr  ougr  we  may  use  phonemes 
as  *he  basic  input)  has  been  placed  in  a  separate  box  as  the  oniy  shelf  in  that  box.  Each  shelf 
contains  as  its  label  the  word  and,  as  its  sister,  the  number  of  the  box  that  has  the  next  word. 

A  sister  whose  value  is  0(e.g.,  in  box  6  of  Fig.  1)  indicates  'be  last  word  of  the  sentence.  pars~ 
ing  involves  placing  boxes  on  top  of  these  input  boxes  and  adding  shelves  to  existing  boxes  to 
express  syntactic  information  about  the  sentence.  Each  shelf  in  such  an  added  box  contains 


information  about  structure  covering  a  part  of  the  sentence  from  the  word  indicated  by  the  lowest 
shelf  of  the  column  to  just  before  its  siste*- 

Figure  6  shows  the  chart  after  a  complete  parse.  Grammatical  information  relating  to  e 
particular  point  of  the  input  string  is  represented  in  boxes  placed  in  the  same  column  as  that 
word,  i’he  grammatical  category  represented  is  shown  as  the  label  of  the  shelf,  and  the  extent 
is  shown  as  the  sister  of  the  slmlf.  In  looking  at  the  chart,  lry  to  see,  instead  of  a  number  as 
the  value  of  the  sister  feature  of  each  shelf  (e.g.,  "sis  5"),  an  arrow  pointing  from  that  shelf  to 
the  column  whose  number  is  the  value  of  "sis."  We  do  not  actually  dispiay  the  arrows  be  ause 
there  are  so  many  that  the  resulting  picture  would  be  very  confusing. 

Looking  at  tie  first  column  of  boxes  in  Fig.  6,  note  that  three  boxes  have  been  placed  on  top 
of  the  original  box  1  [box  numbers  are  in  the  lower-left  corner  and,  except  for  the  input  string, 
bear  no  significant  relationship  to  the  meaning  of  tne  box  Uhe  numbers  are  assigned,  like  house 
numbers  on  a  Tokyo  street,  in  chronological  order  of  creation  of  the  t.o.\)j  Box  1  has  had  fea¬ 
tures  added  'o  it  to  specify  its  status  as  a  determiner  fdet),  adjective  (adj),  noui;  (n),  beginning 
of  noun  phrase  (np),  and  beginning  of  sentence  is).  The  parse  has  determined  that  the  first  word 
of  the  sentence  "the"  is  a  determiner,  is  not  an  adjective  or  noun,  and  begins  a  noun  phrase  and 
a  sentence.  Further,  information  about  its  status  as  a  determiner  is  stored  in  box  10,  informa¬ 
tion  about  the  noun  phrases  it  begins  is  stored  in  box  9,  and  information  about  sentence!  it  begins 
is  stored  in  box  8.  The  boxes  referred  to  are  stacked  on  top  of  the  box  they  relate  to  and  we  can 
see  that  box  8  shows  that  there  art-  two  sentences  (s)  beginning  with  "the,"  the  first  one  covering 
the  entire  input  string  (since  its  sister  is  0)  and  the  second  covering  the  first  four  words  (its 
sister  being  5).  Note  that  each  se»t«-n  e  is  represented  by  a  separate  shelf  in  box  8.  Box  9  shows 
information  about  the  two  noun  phrases  tnp>  beginning  with  "the’’  in  the  same  manner.  The  in¬ 
ternal  representation  of  the  chart  lias  each  up  anti  s  shelf  contain  information  about  the  structure 
of  the  np  or  s  it  represents,  but  this  could  r.ot  be  fitted  into  the  graphical  representation.  The 
structures  found  by  the  parse  could  be  represented  by  the  trees  shown  in  Fig.  7.  (The  sentence 
that  does  no'  cover  the  entire  input  string  asserts  that  large  people  carry  food.) 

2.  Example  of  a  Grammar 

The  grammar  is  a  collection  of  stales,  each  of  which  has  one  or  more  arcs,  it  may  be 
visualized  by  representing  each  state  as  a  node  in  a  network,  and  each  arc  as  a  line  coming  out 
of  that  node  and  pointing  to  another  node  (except  for  arcs  that  terminate  a  network).  Figure  4 
shows  a  representation  of  the  grammar  that  was  vsed  to  create  the  chart  described  above.  Each 
arc  is  a  little  program  which  is  run  to  determine  whether  the  srr  applies  and,  if  it  does,  what 
is  to  be  done. 

Figure  8  gives  a  simplified  form  of  the  grao  r,  which  is  essenPailv  ;*  transition  net.  It 
says  that  a  sentence  is  composed  of  a  noun  pi  ase  followed  by  a  verb  followed  by  a  noun  phrase, 
and  that  a  noun  phrase  is  composed  of  an  optional  determiner  followed  by  sr.y  number  (including  01 
of  adjectives  followed  by  a  noun.  Figure  8  is  a  concise  description  of  the  language  parsed  by  the 
grammar,  but  it  docs  not  show  any  information  about  the  process  structure  of  the  parsing  system. 

The  user  is  not  required  to  write  a  GSP  grammar  directly.  Bather,  he  will  be  able  to  ex¬ 
press  his  grammar  in  a  form  congenial  to  the  way  he  thinks  about  it,  and  wc  will  provide  a 
compiler  to  translate  his  input  into  GSP's  internal  form.  In  this  way,  the  GSP  grammar  language 
could  be  thought  of  as  a  direct  language  for  a  GSP  machine  with  compilers  provided  from  higher- 
level  languages.  Currently,  wc  have  a  compiler  which  accepts  as  input  a  dictionary  giving  words 


li 


Fig.  5  GSP  chart  at  beginning  of  parsing  sentence:  "The  big  bear  eats  swe  ,t  honey." 


iZ 


6lfl  &«r  wt»  IwMt  Hon«» 


Fig.  7.  Parse-tree  renresemation  of  GSP 
chart  of  Fig.  6. 


Fig.  8.  Transition  net  representation  of  gramm;  r 
used  to  obtain  GSP  chart  in  Fig.  6. 


13 


..nd  their  phonemic  transcriptions,  and  which  produces  a  CSP  grammar  to  recognize  words  from 
a  phonemic  input. 

3.  Processes 

The  first  process  in  a  parsing  is  typically  an  initiator  process  which  then  itsell  initiates  a 
process  to  actually  begin  searching  for  structure.  In  our  examples,  the  initiator  p'ocess  starts 
at  'he  first  box  but  it  could  begin  anywhpre  in  the  chart.  Star-ting  a  process  merely  puts  some 
data  about  its  initial  conditions  on  a  process  list.  In  principle,  all  processes  run  in  parallel, 
but  since  the  TX-2  is  not  a  parallel  machine  there  is  a  control  program  which  determines  which 
of  the  processes  cn  the  process  list  is  to  be  run  at  a  giver,  instant.  In  the  current  implementation 
of  GSP,  we  have  arbitral  tly  picked  a  particular  order,  but  we  hope  to  explore  the  effects  of 
varying  the  algorithm  for  determining  which  process  is  to  be  in  execution  at  any  given  time. 

Each  process  has  a  number  of  paths,  each  of  which  contains  a  grammar  focus,  a  chart  focus, 
and  a  list  of  registers.  The  grammar  focus  specifies  a  particular  part  of  the  grammar  (i,e.,  sta.e 
a.id  arc),  the  chart  focus  specifies  a  place  in  ‘he  chart  (i.e.,  shelf  of  a  box),  and  each  register 
specifies  some  conclusion  about  the  stru  ure  the  process  is  heading  toward  using  that  path 
(e.g.,  a  path  in  the  process  looking  for  »  sentence  remembers  the  first  noun  phrase  of  the  sen¬ 
tence  as  the  value  of  the  register  "subjr  (for  subject)],  »Vhen  a  path  in  a  process  is  giver,  control, 
it  applies  the  arc  of  the  grammar  specified  by  the  grammar  focus  o  the  box  and  shelf  specified 
by  the  chart  focus. 

When  a  process  is  initiated,  it  is  necessary  -  specify  the  initial  state  and  the  name  of  the 
constituent  being  looked  for.  The  process  initiating  program  will  check  whether  there  already 
is  a  process  with  the  same  initial  state  and  constituent  being  looked  for  initiated  from  the  same 
shell  in  the  chart.  If  the  process  is  new,  then  a  process  l.-mber  is  arbitrarily  assigned  the 
process,  an  empty  box  is  created  as  the  process  communication  port,  and  a  feature  is  added  to 
the  shelf  from  which  the  initiating  is  done  with  that  box  number  as  its  value. 

During  a  pare?,  the  chart  !s  augm  .nted  by  displaying,  for  each  path,  a  process  infor- ration 
rectangle  inside  the  shelf  that  is  its  cfi.art  focus.  This  rectangle  contains  information  about  that 
path  to  enable  the  user  to  see  which  processes  art  paying  attention  to  particular  parts  of  the 
chart.  An  arrow  is  displayed  to  the  loft  of  the  process  information  rectangle  of  the  path  that  is 
currently  being  run. 

4.  Plans 

It  is  our  hope  that  the  graphic  chart  display,  with  information  about  eacn  process  shown  at 
the  point  in  the  chart  that  is  the  process  chart  focus,  will  enable  the  grammar  vvriter  to  literally 
see  what  is  going  on  when  his  grammar  is  applied  to  the  input  data.  Snce  the  chart  w-ill  not,  in 
genera),  fit  on  the  scope,  only  part  of  it  may  b?  displayed  at  a  given  tine.  The  user  has  controls 
that  enable  him  to  specify  which  parts  he  wishes  to  look  at.  The  user  aiso  may  have  some  control 
over  the  actual  order  in  which  the  parsing  occurs.  He  mav  specify  situations  in  which  the  pro¬ 
gram  will  halt  and  he  may  point  to  the  process  he  wishes  to  start  next.  In  the  near  future,  we 
plan  to  extend  the  ways  in  whicn  the  user  may  control  a  parse  by  allowing  him  to  request  infor¬ 
mation  r.ot  shown  in  the  chart. 

Currently,  processes  have  no  communication  with  each  other  other  than  by  what  one  process 
adds  to  a  box  being  used  as  a  communication  port.  However,  it  appears  desirable  tc  allow,  say. 


14 


a  syntax  process,  to  make  a  suggestion  to  a  word-finding  process  about  what  categoi  of  words 
are  expected.  We  plan  to  add  this  capability  in  the  near  future. 

The  only  grammars  now  running  with  GSP  are  simple  test  grammars  (except  for  the  word¬ 
finding  grammar  which  has  a  vocabulary  of  300  words  but  assumes  that  there  is  no  error  in  the 
input).  The  next  step  is  to  put  some  real  grammars  in  and  evaluate  GSP  as  a  grammar-developing 
tool  as  well  as  a  parsi:  g  scheme. 

In  the  course  of  developing  GSP,  we  developed  as  tools  a  list  processing  system  and  a  tree 
display  and  editing  program.  These  programs  may  find  other  applications,  and  are  being  docu¬ 
mented  for  the  TX-2  user  community.  The  list  processor  is  a  collection  of  subroutines  in  BCPL 
that  allow  the  construction,  reading,  and  writing  of  list  structures.  Even  though  the  list  proc¬ 
essing  system  is  different  »rom  LISP  (our  system  uses  cells  of  arbitrary  numbers  of  addresses 
and  reference  couni  reclaiming  rather  than  LISPs  garbage  collection),  it  has  proved  possible  to 
come  close  to  copying  LISP  programs  directly.  The  tree  display  and  editor  have  been  imple¬ 
mented  using  the  ’ist  processor  and,  with  a  program  to  convert  any  list  structure  to  a  tree,  have 
proven  to  be  a  valuable  aid  in  debugging  programs  which  build  list  structure. 

C.  LPARS  -  A  Locally  Organized  PARSer  for  Spoken  Input 

LPARS*  has  been  implemented  on  the  TX-2.  It  is  designed  to  process  continuous  speech 
with  the  help  of  syntactic  and  semantic  information. 

The  LPARS  system  differs  from  traditional  parsing  methods  in  that  it  has  no  inherent  left- 
to-right,  or  right-to-left,  bias  to  its  operation.  Rather,  it  allows  syntactic  structures  to  be 
recognized  locally  in  any  part  of  an  utterance.  In  fact,  several  pa.  se  structures  may  be  built 
up  simultaneously  in  different  parts  of  the  sentence,  and  later  connected  together  by  searching 
for  words  that  might  reasonably  exist  between  them. 

Thus,  words  and  phrases  reliably  recognized  ir  any  ps:i  of  a  sentence  can  be  used  to  help 
guide  the  search  for  further  words  to  complete  the  sentence. 

t.  LPARS  Task  Domain 

LPARS'  vocabulary  contains  approximately  70  words.  It  recognizes  a  very  restricted,  but 
linguistically  natural  and  interesting,  subset  of  English.  Its  semantics  are  defined  in  terms  of 
a  particular  scene  -  a  small  two-room  house  containing  people,  furniture,  fixtures,  etc.  -  about 
which  one  may  make  statements,  ask  questions,  tell  a  very  simple-minded  story,  or  command 
the  system  to  manipulate  the  scene.  Sample  input  sentences  are; 

"The  coffeetable  on  which  the  ashtray  is  placed  by  Robert  supports  the  dictionary." 

"What  does  the  stdetable  support?" 

2.  Operation  of  LPARS 

LPARS  expects  as  input  a  string  of  phoneme  candidates  from  a  front-end  phoneme  recognizer. 
For  the  present  work,  the  input  was  prepared  by  a  phonetic  scrambler  program  which  simulated 
front-end  behavior,  rather  than  by  a  real  phoneme  recognizer.  The  input  phoneme  candidates 
maybe  ambiguous  (i.e.,  several  possibilities  maybe  given  for  one  segment).  The  input  is  also 
expected  to  contain  a  fairly  large  amount  of  error. 

’"The  system  w’as  developed  by  P.  L.  Miller  as  part  of  a  PhD  thesis  at  M.l.T.  It  will  be  published 
as  Lincoln  Laboratory  Technical  Report  E03. 


15 


INPUT  SENTENCE: 

'THE  LARGE  COFFEETABLE  SUPPORTS  THE  GREEN  DICTIONARY' 


(1)  INITIAL  SCAN:  COFFEETABLE,  GREEN 

(2)  LOCAL  HIGHER  DISTANCE  SCANS: 


STRCH 

COFFEETABLE 

(ad  j  1 

(prep) 

(det) 

(verb) 

LARGE 

NIL 

(adj) 

(det) 

GREEN 

(ad  j )  (noun) 

(del)  DICTIONARY 
THE 
(verb) 

(prep) 


THE 

(3)  PARTIAL  PARSE  TREE  CONSTRUCTION: 


STRCH  SENT 

/ 

NP 

A 

the  large  coffeetablo 


NIL 


NP 

A 

the  green  dictionary 


ENOCH 


(4)  CONNECTION  OF  FARTIAL  PARSE  TREES: 


STRCH  SENT 

/ 

NP 

A 

(5)  RECOGNIZED  SENTENCE: 


- •  --verb  ? 


NP 

A 


ENOCH 


STRCH 


NP 

A 


_ S 


SENT- 

I 

VERB 

I 

supports 


ENOCH 


NP 

A 


Fig.  9,  Simple  example  of  LPARS  in  operation. 


[  n-?-io«tTj 


ENOCH 


16 


LPARS  operates  by  first  making  an  initial  scan  through  the  sentence  looking  for  longer  words 
which  are  not  too  garbled.  The  scan  returns  a  list  of  word  candidates  which  match  within  a  given 
error  tolerance  to  specified  sections  of  the  input.  It  is  likely  that  some  of  these  wold  candidates 
are  incorrect,  and  indeed  some  may  overlap. 

These  word  candidates  are  turned  over  to  the  higher-level  part  of  the  system  which  initiates 
scans  for  •small  words  and  for  more  highly  garbled  words  in  the  areas  adjacent  to  and  between 
the  words  found  in  the  initial  scan,  and  groups  tne  words  together  into  parse  structures.  This 
proces  sing  is  done  systematically  in  an  attempt  to  unccver  the  entire  utterance. 

3.  A  Simplified  Example 

Figure  9  is  a  simplified  example  of  LPARS  in  operation.  The  input  utterance  being  processed 
is  the  sentence:  "The  large  coffeetable  supports  the  green  dictionary."  In  real  operation,  the 
sentence  would  be  spoken,  analyzed  by  a  front-end  phoneme  recognizer,  and  a  string  of  phoneme 
candidates  produced  for  input  to  LPARS.  Currently,  Ih's  input  is  produced  by  a  front-end 
simulator. 

LPARS  first  maxes  an  initial  scan  through  the  sentence,  testing  the  words  in  its  ’ocabulary 
against  the  ent.re  length  of  the  input.  This  scan  is  made  at  a  low  phonetic  distance  looking  for 
longer  words  which  are  not  too  garbled.  (Phonetic  distance  is  a  measure  of  hew  closely  the 
phonetic  spelling  of  a  word  matches  a  section  of  input.)  Let  us  assume  that,  in  this  instance, 
the  two  word  candidates  "coffeetable"  and  "green"  are  found.  These  word  candidates  are  turned 
over  to  the  high-level  part  of  the  svstem,  together  with  a  start-of-sentence  character  (STRCII) 
and  an  end-of-sentence  character  (ENDCH)  which  the  system  adds. 

The  higher-level  analysis  consists  of  three  fairly  distinct  stages,  as  Fig.  9  indicates.  First, 
scans  at  higher  phonetic  distance  are  made  based  on  fairly  local  cues,  in  the  area'  adjacent  to 
the  word  candidates.  In  this  example,  these  local  scans  are  very  simple.  A  scan  in  front  of 
the  noun  uncovers  the  adjective  "large"  and  a  further  scan  in  from  of  that  word  uncovers  the 
determiner  "the."  A  scan  to  the  ngnt  of  the  noun  for  prepositions  and  verbs  fails  to  find  any  word 
candidates.  This  means  that  the  verb  "supports"  is  too  garbled  to  be  picked  up  either  by  the 
initial  scan  or  by  the  higher  phonetic  distance  selective  s  an.  Similar  scans  around  the  adjective 
"green"  uncover  the  words  "dictionary"  and  "the,"  but  again  fail  to  find  the  verb. 

After  these  higher-distance  scans  have  taken  place,  the  system  huilds  up  as  many  parse  trees 
as  it  can  in  the  sentence.  If  all  the  words  in  the  sentence  have  been  found,  then  the  entire  sen¬ 
tence  is  constructed.  Otherwise,  the  result  is  a  number  of  partial  parse  trees  (PPTs)  in  different 
parts  of  the  sentence. 

The  system  attempts  to  construct  as  many  such  PPTs  as  it  can  with  the  words  it  has  found. 

In  this  example,  it  constructs  only  two  PPTs.  The  first  tree,  "STRCH  the  large  coffeetable,"  is 
straightforward;  the  second,  "the  green  dictionary  ENDCH,"  is  somewhat  unusual  since  it  contains 
an  "ancestor  link"  (the  link  labeled  "*").  The  ancestor  link  allows  LPARS  to  recognize  that, 
due  to  right  recursion,  many  syntactic  relationships  ere  possible  but  to  defer  commitmert  until 
later  when  an  attempt  is  made  to  join  this  structure  to  another  structure  by  proposing  words 
between  them. 

The  third  and  final  step  in  the  parsing  process  consists  of  connecting  PPTs  to  one  another 
by  using  the  grammar  to  propose  words  that  might  exist  between  them.  Any  ivords  proposed  are 
tested  a  gains*  the  input  at  even  higher  phonetic  distances  than  the  previous  scans. 


1? 


In  the  present  example,  the  algorithm  discovers  that  the  two  parse  trees  can  he  connected 
if  a  verb  is  found  between  them.  It  therefore  initiates  a  higher-distance  scan  which  succeeds  in 
finding  the  verb  "supports."  Thus,  the  entire  sentence  is  recognized. 

Notice  that  this  example  is  a  simplified  one.  No  erroneous  words  were  found.  In  a  more- 
realistic  example,  erroneous  words  would  be  found,  some  local  structures  containing  these  words 
would  be  built  up,  and,  additionally,  some  erroneous  local  parsings  of  the  correct  words  could 
be  eons:ructed.  Hopefully,  attempts  to  build  out  upon  the  erroneous  structures  would  eventually 
prove  unsuccessful,  while  attempts  to  build  out  upon  the  correct  structures  would  usually  succeed. 

4.  Experimental  Evaluation 

The  LPARS  system  was  evaluated  by  processing  50  sentences.  These  input  sentences  were 
produced  by  a  front-end  simulator  which  approximated  the  accuracy  of  the  Vicens-Reddy  phoneme 
recognizer.  The  simulator  operates  roughly  as  follow's:  15  percent  of  the  phonemes  are  deleteu; 
10  percent  of  the  remaining  phonemes  are  badly  scrambled  (i.e.,  a  stop  might  be  changed  into  a 
vowel  or  fricative);  the  remaining  phonemes  are  substituted  for  at  random  from  a  restricted 
class  of  phonemes  (i.e.,  a  stop  might  be  changed  into  some  other  stop  chosen  at  random). 

During  the  evaluation,  if  several  oossibie  sentences  were  found,  the  recognition  of  the  sen¬ 
tence  was  considered  successful  if  the  correct  semen  was  included  among  these.  When  several 
sentences  were  found,  the  correct  sentence  was  almost  always  the  best  match. 

During  evaluation  of  LPARS,  the  50  sentences  were  processed  with  the  following  results: 

23  sentences  we»v  correctly  recognized  from  words  found  only  by  the  initial  local 
processing. 

19  sentences  were  correctly  recognized  by  the  Pi’T  connection  algorithm,  after 
the  initial  local  processing. 

3  sentences  were  correctly  recognized  by  fallback  methods. 

2  sentences  resulted  in  incorrect  recognition.  In  both  cases,  the  sentence  found 
differed  from  the  input  sentence  in  a  single  content  word. 

3  sentences  resulted  in  failure  to  find  any  possible  sentence  at  all. 

Thus,  the  overall  success  rate  of  LPARS  with  the  50  input  sentences  was  90  percent.  The 
evaluation  of  LPARS  described  above  primarily  establishes  that,  given  a  simulation  of  a  fairly 
crude  front  end,  the  ideas  embodied  in  LPARS  can  be  made  to  work  acceptably. 

Ill.  SPEECH  DATA  BASE 

The  speech  data  base  has  moved  into  a  period  of  active  use.  Data  for  comparative  analysis 
have  been  distributed  to  other  A RPA- supported  organizations  for  use  in  workshop  meetings.  The 
first  simh  meeting  was  held  at  the  Speech  Communication  Res  arch  Laboratory  in  Santa  Barbara, 
California  in  March  1973.  The  second  is  scheduled  for  July  1973  at  Carnegie-. Mellon  University. 
For  the  first  meeting,  we  distributed  audio  tape  copies  of  six  utterances  to  six  other  organizations 
who  presented  display's  of  the  results  of  their  processing  of  the  data  at  tne  meeting.  For  the 
second  meeting,  copies  of  3t  utterances  have  been  distributed  to  twelve  other  organizations, 

Most  of  the  groups  received  digital  tapes,  either  to-  or20-kliz  sampled  data,  in  the  data-base 
format.  Some  groups  received  audio  tapes  made  from  the  digital  versions.  We  have  already- 
received  indications  that  the  uniformity  resulting  from  digitization  will  greatly  aid  in  evaluating 
the  many  analysis  techniques  to  be  compared  at  the  workshop. 


1R 


We  now  have  the  waveforms  for  75  utterances  in  the  dita  base  and  are  at  the  practical  limit 
of  on-line  storage.  We  do  noc  expect  to  add  more  utterances  until  the  new  disk  system  (see 
Sec.lV-B)  is  available.  Current  activities  to  process  and  label  the  existing  data  will  use  the  bulk 
of  the  remaining  spat  e. 

A.  Speech  Input/Output  Hardware 

The  specially  designed  speech-data  input  and  output  system  has  been  through  a  shakedown 
phase  as  a  result  of  data  requirements  for  the  July  1973  workshop.  After  some  initial  problems 
were  corrected,  satisfactory  performance  was  achieved.  The  speech  input/output  system  is 
designed  to  input  analog  speech  data  and  tu  provide  analog  outputs  of  the  processed  speech.  Con¬ 
ceptually,  this  is  a  simple  analog-to-digital  and  digital -to -analog  conversion  process.  However, 
to  facilitate  the  handling  of  the  large  quantities  of  speech  data  and  to  meet  other  requirements, 
special-purpose  hardware  was  developed;  a  block  diagram  of  this  hardware  is  shown  in  Fig.  10. 
One  requirement  of  the  speech  data  is  that  two  sets  of  data  are  required  for  each  of  the  sentences. 
One  set  Is  considered  as  normal  speech  with  a  channel  bandwidth  of  5  kHz.  The  other  set  is 
wideband  or  high-fidelity  speech  with  a  bandwidth  of  1C  kHz.  These  two  sets  of  samples  must 
be  synchronized  so  that  different  processes  can  later  be  correlated.  This  requirement  was  met 
by  feeding  the  speech  through  two  sets  of  presampling  filters,  one  of  5-’tHz  and  the  other  of 
10-kHz  bandwidth.  Then  the  filtered  signals  are  fed  to  an  analog  multiplexer  that  commutates 
at  a  40-kl.Tz  rate.  The  output  of  the  analog  multiplexer  is  then  digitized.  One  out  of  every  four 
conversions  is  not  passed  to  TX-2,  so  that  both  the  5-  and  10-1  lz  speech  are  sampled  at  their 
Nyquist  rate.  Thus,  the  sa  r.ples  oi  the  two  sets  of  speech  data  are  automat.' cally  synchronized, 
and  only  one  pass  of  the  speech  is  needed  to  obtain  the  desired  data. 

To  minimize  the  amount  of  storage  and  processing  involved  in  dealing  with  silent  intervals 
adjacent  to  speech  signals,  an  analog  threshold  detector  is  used  to  detect  the  presence  of  speech 
and  control  the  conversion  process.  However,  some  utterances  begin  with  low-level  sounds  that 
have  significant  inform-  tion.  Therefore,  the  speech  is  fed  through  the  equivalent  of  an  analog 
delay  mechanism  realized  by  a  modified  tape  recorder.  The  rcrmal  recording  head  is  used  as 
a  pickup  head  for  the  threshold  detector.  The  true  speech  is  obtained  through  the  normal  play¬ 
back  channel.  Thus,  there  is  a  delay  of  approximately  half  a  second  between  the  thresholding 
signal  and  the  speech  signal  to  be  sampled.  This  period  proves  to  be  sufficient,  and  this  arrange¬ 
ment  works  quite  well. 

One  other  requirement  is  to  pre-emphasize  the  higli  frequencies  of  the  imnit  speech  prior 
to  the  analcg-to-digitai  conversion.  The  choice  of  a  pre-emphasis  characteristic  involves  a 
compromise  between  loss  of  information  of  weak  signals  and  distortion  of  strong  signals  caused 
by  either  overloading  or  aliasing.  After  some  experimentation,  we  hove  settled  on  the  pre- 
emphasis  characteristic  shown  in  Fig  11. 

The  speech  output  system  dc-emph’-'-izcs  the  high  freq  encies  to  complement  tiie  input  char- 
a>  teristics  so  that  flat  overall  frequency  response  is  maintained.  The  combined  input  and  output 
frequency  response  is  shown  in  Fig.  l  7. 

B.  Automatic  aboling 

The  quantity  of  data  involved  with  the  buildup  of  the  speech  data  base  and  the  desire  for  strict 
consistency  in  labeling  it  indicated  the  need  to  automate  as  much  ot  the  labeling  task  as  possible. 

A  design  for  an  automatic  labeler  was  presented  in  the  last  SATS.^  Essentially,  it  uses  the  results 


RELATIVE  GAIN  (08! 


IH-mouT] 


INPUT 

samples 


10-hH* 

SAMPLES 


Hh 


2  5  fiitc 


5-*h  I 

samples 


OUTPUT 

SAMPLES 


Fig.  10.  TX-2  speech  inprt/output  system. 


Fig.  11.  Frequency  response  of  TX-2  speech  input  system  showing 
net  effect  of  pre-emphasis  and  presampiing  filters. 


20 


RELATIVE  GAIN  MB) 


-{*!-*>**;• 


to  to1  to’  to* 

FSEOOtNC*  itu/MC) 


Fig.  12,  Combined  input  and  output  frequency  response  for  TX-2 
speech  input/output  system. 


of  gross  acoustic  processing  (segmentation  and  classification  into  such  recognition  units  as  vowel¬ 
like,  aspiration,  fricative,  silence,  etc.)  to  assign  the  start  and  end  times  for  the  phonemes  in 
the  phonemic  transcription  of  an  utterance.  With  some  modifications,  the  design  has  been  imple¬ 
mented  and  is  undergoing  extensive  testing. 

The  results  thus  far  are  good  enough  to  be  useful  since  in  the  test  run  about  80  percent  of 
the  labels  have  been  correctly  placed,  but  better  use  of  predicted  durations  and  increased  knowl¬ 
edge  of  the  idiosyncrasies  of  the  acoustic  programs  should  produce  even  greater  accuracy. 

The  basic  problem  involves  the  recovery  from  wrong  assignments  caused  by  errors  in  the 
acoustic  processing.  At  any  point,  there  are  three  possible  assignments: 

(1)  Assign  the  next  phoneme  to  the  next  acoustic  segment. 

(2)  Assign  the  next  phoneme  to  the  current  acoustic  segment.  (This  should  be  done  if 
the  phoneme  was  not  isolated  as  a  separate  acoustic  segment,  as  is  often  the  case 
with  semivowels.) 

(3)  Assign  the  current  phoneme  to  the  next  acoustic  segment.  (This  should  be  done  if 
tnc  acoustic  segment  is  a  sub-phonemic  event,  such  as  the  burst  in  a  plosive.) 

of  the  errors  and  lack,  of  specificity  in  the  current  acoustic  processing,  it  is  often 
not  clc.  at  the  moment  which  of  the  three  should  be  Jone.  Only  later  does  one  find  that  he  is  on 
the  wrong  track  and  should  return  to  some  earlier  decision  point  to  take  a  different  path. 

The  heuristic  search  paradigm  is  used  to  organize  the  decision-making.  Each  state  (sequence 
of  assignments)  is  assigned  a  value  by  an  evaluation  function,  and  the  best  scoring  of  all  previ¬ 
ously  generated  states  is  made  the  current  state.  Thus,  backup  is  not  in  any  predetermined 
sequence,  but  instead  is  decided  upon  dynamically  as  a  function  c  f  the  goodness  of  fit  of  the  alter¬ 
nate  hypotheses.  Moreovr the  evaluation  function  can  actually  be  a  subroutine  of  any  degree 
of  complexity,  and  thus  can  take  into  account  phonological  rules,  expected  durations,  and  any 
knowledge  of  the  types  of  error  most  likely  to  occur  in  the  acoustic  processing. 

For  those  phonemes  which  were  not  isolated  as  separate  acoustic  segments,  boundaries  are 
estimated  on  the  basis  of  the  expected  duration  of  the  phoneme,  given  its  context. 

C.  Data-Base  Software 

Software  written  specifically  for  speech  research  on  TX-2  is  now  quite  extensive.  While 
most  of  this  has  been  described  in  earlier  reports,  substantial  improvements  are  being  made  or 
have  been  completed  for  these  facilities,  in  response  to  needs,  suggestions,  and  reactions  of 
users.  All  speech  workers  on  TX-2  use  the  Speech  Frocessing  Controlle’*  (SPC).  This  is  an 
operating  system  and  command  interpreter.  New  capabilities  and  improvements  have  been  added 
to  it  over  the  last  year -and -a -ho If,  and  it  is  now  extremely  well-tailored  to  the-  project. 

The  A/D  input  received  heavy  use  for  the  first  time,  and  this  resulted  in  a  number  of  modi¬ 
fications  to  the  associated  software.  Validation  of  the  data  at  several  stages  was,  of  course, 
essential.  Listening  to  the  D/A  output  was  especially  helpful  for  this.  Several  programs  were 
written  to  fc  ilitate  data  access  and  D/A  of  the  10-  and  20-kIIz  waveforms  at  three  stages:  after 
digitization  hut  before  entry  into  the  data  base;  after  entry  into  the  data  base;  after  being  written 
to  tape.  To  simplify  the  validation  process,  the  A/D  and  D/A  programs  have  now'  been  ret  ritten 
to  use  the  sub-directory  facility  and  hence  be  consistent  with  other  software.  Certain  critical 
pieces  have  been  receded  in  machine  language  to  speed  up  the  entire  operation. 


One  of  our  users  is  making  a  systematic  study  of  the  effects  of  phonetic  context  on  vowels. 
This  work  requires  a  particular  CVC  syllable  to  be  spoken  in  a  earlier  sentence.  In  response 
to  the  need  to  get  these  syllables  into  the  data  base,  a  selective  input  facility  has  been  built.  To 
achieve  reasonable  storage  efficiency,  a  number  of  individual  syllables  are  collected  together 
into  a  "pseudo  utterance."  This  facility  allows  an  operator  to  input  the  syllable  in  its  carrier 
sentence  via  the  A/D  converter,  and  use  the  graphic  display  facility  and  D/A  to  select  the  syllable 
of  interest.  The  waveforms  for  the  individual  syllables  are  entered  into  the  data  base  as  separate 
fields,  until  all  those  that  are  to  be  put  into  a  single  pseudo  utterance  have  been  collected.  Then 
a  single  field  is  constructed  containing  the  concatenated  syllables,  with  silences  inserted  between 
them.  The  short  individual  fields  are  then  deleted. 

D.  SURNET 

The  SURNET  server  described  in  the  last  SATS  has  been  implemented.  The  current  version 
only  provides  the  minimum  response  to  a  "query"  command.  After  establishing  a  connection  to 
SURNET,  the  user  process  must  transmit  a  simple  parameter  list  consisting  of  the  query  com¬ 
mand  code,  the  entry  number,  and  field  tvpe  of  a  field  to  be  retrieved  from  the  data  base.  If 
the  search  is  icessful,  the  field  length  and  the  data  are  transmitted  to  the  user  process.  If 
the  requested  field  cannot  be  located  or  problems  with  the  network  arise,  SURNET  closes  the 
connections.  In-house  tests  have  established  that  this  version  of  SURNET  is  operating  correctly. 
We  are  presently  running  tests  between  BBN  TENEX  and  TX-2.  Following  these  tests,  the  server 
will  be  developed  further  to  accept  a  complete  parameter  list  instead  of  the  simple  two-word  list, 
to  return  a  complete  network  header  with  the  data  instead  of  field  length  alone,  and  to  return  in¬ 
formative  error  codes  in  response  to  system  failures  rather  than  close  the  connections.  The 
"store"  command  will  be  added  whe..  “Dace  allocation  conventions  and  controls  have  been 
determined. 

IV.  SYSTEM  ACTIVITIES 

A.  TSP  System 

A  number  of  spurious  errors  (bad  parity  indications  and  memory  addressing  failures)  led  us 
to  suspect  that  one  of  the  eight  memory  controllers  was  erratic.  It  was  replaced  and  now  the 
hardware  reliability  is  adequate  for  a  production  system.  There  are  still  occasional  failures, 
however,  which  we  are  continuing  to  track  down. 

We  are  in  the  process  of  implementing  a  text  editor  for  the  TSP.  It  uses  the  strategy  pio¬ 
neered  by  the  TX-2  storage  scope  editor  of  continuously  displaying  a  page  of  text  around  the  cur¬ 
rent  cursor  position.  Changes  to  the  text  or  movement  of  the  cursor  off  the  page  cause  the  scope 
to  be  repainted.  The  heart  of  the  editor  has  been  implemented,  allowing  us  to  edit  text  stored 
in  the  TSP  core  memory.  The  interfaces  to  the  TSP  disk  and  the  ARPA  network  are  being  worked 
out. 

The  basic  disk-handling  routines  have  been  completed,  allowing  us  to  write,  read,  and  delete 
blocks  of  data  on  the  disk.  These  routines  keep  track  of  which  areas  of  the  disk  are  used  and 
which  are  free.  Data  blocks  are  identified  by  16-bit  numbers  (actually,  the  track  and  sector 
address  of  the  beginning  of  the  data).  We  are  working  on  a  file  system  which  will  aliow  us  to 
reference  the  data  with  character  string  names. 


23 


B.  TX-2  System 

A  number  ot  changes  have  been  made  to  TX-2  hardware  and  software  in  order  to  provide 
support  for  the  various  current  areas  of  activity. 

Drum  space  allocated  on  a  fixed  basis  for  dedicated-machine  projects  was  reorganized  and 
drastically  reduced  The  reclaimed  space,  having  a  capacity  for  about  3  million  TX-2  words, 
is  now  being  used  for  the  Speech  Data  Base. 

The  magr.etic-tape  facility  was  expanded  by  the  addition  of  a  fourth  tape  unit. 

In  designing  for  the  incorporation  of  the  disk  into  the  APEX  tirne-sharing  system,  a  study 
of  file  size  distribution  on  the  drum  was  performed.  Results  revealed  that  while  the  mean  file 
size  is  eight  pages,  a  disproportionate  number  of  one-page  files  (about  2  3  percent  of  all  files) 
reduces  the  median  file  size  to  four  pages. 

Early  design  work  was  done  on  an  accounting  system  to  be  implemented  in  APEX.  This  sys¬ 
tem  would  maintain  running  records  of  the  usage  of  svstem  resources  by  the  various  projects. 

The  IBM  3830/3330  disk  memory  has  been  delivered  and  is  currently  being  installed  on 
TX-2.  The  TX-2  l/O  channel  adapter  which  emulates  an  IBM  block  multiplexer  channel  is  also 
being  completed,  and  operation  of  the  disk  on  TX-2  using  this  channel  adapter  should  begin  within 
the  next  month. 

The  TX-2  cycle  stealing  l/O  processor  has  been  operating  on  a  separate  main  memory  bus 
port  for  several  months  with  nearly  maximal  overlap  with  the  TX-2  CPU,  Accesses  to  main 
memory  by  the  eight  l/O  processor  channels  are  mapped  through  the  same  virtual  memory  mech¬ 
anism  used  by  the  CPU.  This  mechanism  has  been  somewhat  restructured  to  simplify  system 
I/O  programming,  and  the  first  enhancement  of  channel  programming  capabilities  is  to  allow 
channel  programs  to  manipulate  directly  the  contents  of  tne  mapping  memories  in  SPAT. 

C.  TX-2  Password  Scheme 

in  many  computer  operating  systems,  a  use,-  authenticates  himself  by  typing  a  secret  pass¬ 
word  known  only  to  himself  and  to  the  system.  T>  ••  system  compares  this  password  with  one 
recorded  in  a  Password  Table  which  is  available  only  to  the  authentication  program.  The  integ¬ 
rity  of  the  system  depends  on  keeping  the  table  secret.  The  TX-2  operating  system  lacks  mech¬ 
anisms  for  insuring  such  secrecy.  We  have  deve/oped  a  password  scheme  which  does  not  require 
secrecy  ar.d  which  is,  therefore,  better  suited  to  the  open  environment  of  TX-2.  All  aspects  of 
the  system,  including  all  relevant  code  and  data  bases,  may  be  known  by  anyone  attempting  to 
intrude. 

The  scheme  is  based  on  using  a  function  H  which  the  would-bc  intruder  is  unable  to  invert. 
This  function  is  applied  to  the  user's  password  and  the  result  compared  with  a  table  entry,  a 
match  being  interpreted  as  authentication  of  the  user.  The  intruder  may  know  both  H  and  the 
table,  hut  he  can  penetrate  the  system  only  if  he  can  invert  H  to  determine  an  input  that  produces 
a  given  output. 

A  paper  has  been  written  for  submission  as  a  journal  article  which  discusses  the  issues 
surrounding  selection  of  a  suitable  K.  A  plausible  argument  is  given  that  penetration  would  be 
exceedingly  difficult.  Apparently,  nc  more  rigorous  result  can  be  obtained:  It  appears  that  any 
analysis  adequate  to  "prove*  the  scheme  would  also  lead  to  enough  knowledge  to  penetrate  it. 

We  plan  to  implement  the  scheme  for  TX-2  password  protection  in  the  near  future. 


24 


REFERENCES 


A.  N.  Stowe,  W.  P  Harris  and  D.  B.  Hampton,  "Signal  and  Context  Components 
of  VVord-Reeogiiition  Behavior,  J.  Acoust.  Soc.  Am.  35,  639-644  (1963). 

Speech  Semiannual  Technical  Summary,  Lincoln  Laboratory,  M.LT 
(30  November  1972),  DDC  AD-754940. 

R.  Kaplan,  "A  General  Syntactic  Processor,"  in  Natural  Language  Processing. 
R.  Rustin,  Ed.  (Algorithmics  Press,  New  York,  1973),  pp.  193-241. 


